6/4/2019

Introduction to R & the Tidyverse

Some details …

  1. Quick tour of R
  2. Tidyverse
  3. ggplot2

Information

R Markdown

  • Notebook structure were text, code, output are combined
  • Benefit that R Markdown files are all text (easy to port)
  • Uses Markdown language to format
  • Plays nicely with RStudio and GitHub

R Markdown file

An R Markdown chunk looks like this:

```{r}
  1 + 1
```

But typically, you will want text surrounding your work.

Here I've done some arithmetic as part of my analysis:
```{r}
  1 + 1
```

Tidyverse

Set Up

require(dplyr); require(tidyr)
require(ggplot2); require(lattice)
require(googlesheets)
require(babynames); require(nycflights13); require(NHANES)
Babynames <- babynames
names(NHANES) <- tolower(names(NHANES))

Tidy Data

  • rows (cases/observational units) and
  • columns (variables).

  • The key is that every row is a case and every column is a variable.

  • No exceptions.

Chaining

The pipe syntax (%>%) takes a data frame (or data table) and sends it to the argument of a function.

  • x %>% f(y) is the same as f(x, y)

  • y %>% f(x, ., z) is the same as f(x,y,z)

Building Tidy Data

  • object_name = function_name(arguments)
  • object_name = data_table %>% function_name(arguments)
  • object_name = data_table %>%
                function_name(arguments) %>%
                function_name(arguments)

  • in chaining, the value (on left) %>% is first argument to the function (on right)

5 Main Data Verbs

Data verbs take data tables as input and give data tables as output

  1. select(): removes unwanted variables (and rename() )
  2. filter(): removes/subsets unwanted cases
  3. mutate(): transforms the variable (and transmute() like mutate, returns only new variables)
  4. arrange(): reorders the cases
  5. summarize(): computes summary statistics

Other Data Verbs

  • distinct(): returns the unique values in a table
  • sample_n(): take a random row(s)
  • head(): grab the first few rows
  • tail(): grab the last few rows
  • group_by(): SUCCESSIVE functions are applied to groups
  • summarise():
    • min(), max(), mean(), sum(), sd(), median(), and IQR()
    • n(): number of observations in the current group
    • n_distinct(): number of unique values
    • first_value(), last_value(), and nth_value(x, n): (like x[1], x[length(x)], and x[n] )

Finally, some Examples!

Babynames %>% nrow()
## [1] 1924665
Babynames %>% names()
## [1] "year" "sex"  "name" "n"    "prop"

Finally, some Examples!

Babynames %>% glimpse()
## Observations: 1,924,665
## Variables: 5
## $ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880,…
## $ sex  <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
## $ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret", …
## $ n    <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 1288,…
## $ prop <dbl> 0.07238359, 0.02667896, 0.02052149, 0.01986579, 0.01788843,…

Finally, some Examples!

Babynames %>% head()
## # A tibble: 6 x 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1  1880 F     Mary       7065 0.0724
## 2  1880 F     Anna       2604 0.0267
## 3  1880 F     Emma       2003 0.0205
## 4  1880 F     Elizabeth  1939 0.0199
## 5  1880 F     Minnie     1746 0.0179
## 6  1880 F     Margaret   1578 0.0162

Finally, some Examples!

Babynames %>% sample_n(size=5)
## # A tibble: 5 x 5
##    year sex   name           n       prop
##   <dbl> <chr> <chr>      <int>      <dbl>
## 1  1992 F     Lacourtney     6 0.00000299
## 2  1893 F     Leonia         8 0.0000355 
## 3  1954 M     Noris          5 0.00000242
## 4  1981 M     Lonnel         6 0.00000322
## 5  1965 M     Autry         15 0.00000791

NHANES Data

names(NHANES)
##  [1] "id"               "surveyyr"         "gender"          
##  [4] "age"              "agedecade"        "agemonths"       
##  [7] "race1"            "race3"            "education"       
## [10] "maritalstatus"    "hhincome"         "hhincomemid"     
## [13] "poverty"          "homerooms"        "homeown"         
## [16] "work"             "weight"           "length"          
## [19] "headcirc"         "height"           "bmi"             
## [22] "bmicatunder20yrs" "bmi_who"          "pulse"           
## [25] "bpsysave"         "bpdiaave"         "bpsys1"          
## [28] "bpdia1"           "bpsys2"           "bpdia2"          
## [31] "bpsys3"           "bpdia3"           "testosterone"    
## [34] "directchol"       "totchol"          "urinevol1"       
## [37] "urineflow1"       "urinevol2"        "urineflow2"      
## [40] "diabetes"         "diabetesage"      "healthgen"       
## [43] "daysphyshlthbad"  "daysmenthlthbad"  "littleinterest"  
## [46] "depressed"        "npregnancies"     "nbabies"         
## [49] "age1stbaby"       "sleephrsnight"    "sleeptrouble"    
## [52] "physactive"       "physactivedays"   "tvhrsday"        
## [55] "comphrsday"       "tvhrsdaychild"    "comphrsdaychild" 
## [58] "alcohol12plusyr"  "alcoholday"       "alcoholyear"     
## [61] "smokenow"         "smoke100"         "smoke100n"       
## [64] "smokeage"         "marijuana"        "agefirstmarij"   
## [67] "regularmarij"     "ageregmarij"      "harddrugs"       
## [70] "sexever"          "sexage"           "sexnumpartnlife" 
## [73] "sexnumpartyear"   "samesex"          "sexorientation"  
## [76] "pregnantnow"

1. select(): removes unwanted variables

find the sleep variables

NHANESsleep <- NHANES %>% 
  select(gender, age, weight, race1, race3, education, sleeptrouble,
         sleephrsnight, tvhrsday, tvhrsdaychild, physactive)
names(NHANESsleep)
##  [1] "gender"        "age"           "weight"        "race1"        
##  [5] "race3"         "education"     "sleeptrouble"  "sleephrsnight"
##  [9] "tvhrsday"      "tvhrsdaychild" "physactive"
dim(NHANESsleep)
## [1] 10000    11

2. filter(): removes/subsets unwanted cases

subset for college students

NHANESsleep <- NHANESsleep %>% filter(age %in% c(18:22))
histogram(~age, data=NHANESsleep)

3. mutate(): transforms the variable

mutate or transmute to create a new variable?

NHANESsleep %>% mutate(weightlb = weight*2.2) %>% head(3)
## # A tibble: 3 x 12
##   gender   age weight race1 race3 education sleeptrouble sleephrsnight
##   <fct>  <int>  <dbl> <fct> <fct> <fct>     <fct>                <int>
## 1 female    21  104.  Black <NA>  Some Col… Yes                      4
## 2 female    21  104.  Black <NA>  Some Col… Yes                      4
## 3 female    22   81.8 Black <NA>  Some Col… No                       5
## # … with 4 more variables: tvhrsday <fct>, tvhrsdaychild <int>,
## #   physactive <fct>, weightlb <dbl>

3. mutate(): transforms the variable

mutate or transmute to create a new variable?

NHANESsleep %>% transmute(weightlb = weight*2.2) %>% head(3)
## # A tibble: 3 x 1
##   weightlb
##      <dbl>
## 1     228.
## 2     228.
## 3     180.

3. mutate(): transforms the variable

mutate or transmute to create a new variable?

NHANESsleep <- NHANESsleep %>% mutate(weightlb = weight*2.2)

5. summarize(): computes summary statistics

# number of people (cases) in NHANES
NHANES %>% summarise(n())
## # A tibble: 1 x 1
##   `n()`
##   <int>
## 1 10000

5. summarize(): computes summary statistics

# total weight of all the people in NHANES (silly)
NHANES %>% 
  mutate(weightlb = weight*2.2) %>% 
  summarise(sum(weightlb, na.rm=TRUE))
## # A tibble: 1 x 1
##   `sum(weightlb, na.rm = TRUE)`
##                           <dbl>
## 1                      1549419.

5. summarize(): computes summary statistics

# mean weight of all the people in NHANES
NHANES %>% 
  mutate(weightlb = weight*2.2) %>% 
  summarise(mean(weightlb, na.rm=TRUE))
## # A tibble: 1 x 1
##   `mean(weightlb, na.rm = TRUE)`
##                            <dbl>
## 1                           156.

5. summarize() with group_by()

# mean weight of all the people in NHANES
NHANES %>% 
  mutate(weightlb = weight*2.2) %>%  
  group_by(education) %>%  
  summarise(mean(weightlb, na.rm=TRUE))
## # A tibble: 6 x 2
##   education      `mean(weightlb, na.rm = TRUE)`
##   <fct>                                   <dbl>
## 1 8th Grade                               173. 
## 2 9 - 11th Grade                          181. 
## 3 High School                             183. 
## 4 Some College                            185. 
## 5 College Grad                            177. 
## 6 <NA>                                     91.7

4. arrange(): reorders the cases

# mean weight of all the people in NHANES
NHANES %>% mutate(weightlb = weight*2.2) %>% 
  group_by(education) %>% 
  summarise(avewt = mean(weightlb, na.rm=TRUE)) %>% 
  arrange(avewt)
## # A tibble: 6 x 2
##   education      avewt
##   <fct>          <dbl>
## 1 <NA>            91.7
## 2 8th Grade      173. 
## 3 College Grad   177. 
## 4 9 - 11th Grade 181. 
## 5 High School    183. 
## 6 Some College   185.

ggplot2 (the packages), ggplot (the function)

goals

What I will try to do

  • give a tour of ggplot2

  • explain how to think about plots the ggplot2 way

  • prepare/encourage you to learn more later

What I can't do in one session

  • show every bell and whistle

  • make you an expert at using ggplot2

Set up

require(mosaic)
require(lubridate) # package for working with dates
data(Births78)     # restore fresh version of Births78
head(Births78, 3)
##         date births wday year month day_of_year day_of_month day_of_week
## 1 1978-01-01   7701  Sun 1978     1           1            1           1
## 2 1978-01-02   7527  Mon 1978     1           2            2           2
## 3 1978-01-03   8825  Tue 1978     1           3            3           3

The grammar of graphics

geom: the geometric "shape" used to display data (glyph)

  • bar, point, line, ribbon, text, etc.

aesthetic: an attribute controlling how geom is displayed

  • x position, y position, color, fill, shape, size, etc.

scale: conversion of raw data to visual display

  • particular assignment of colors, shapes, sizes, etc.

guide: helps user convert visual data back into raw data (legends, axes)

stat: a transformation applied to data before geom gets it

  • example: histograms work on binned data

How do we make this plot?

Two Questions:

  1. What do we want R to do? (What is the goal?)

  2. What does R need to know?

How do we make this plot?

Two Questions:

  1. Goal: scatterplot = a plot with points

  2. What does R need to know?

    • data source: Births78

    • aesthetics:

      • date -> x
      • births -> y
      • default color (same for all points)

How do we make this plot?

  1. Goal: scatterplot = a plot with points

    • ggplot() + geom_point()
  2. What does R need to know?

    • data source: data = Births78

    • aesthetics: aes(x = date, y = births)

first option

ggplot(data = Births78, aes(x = date, y = births)) + 
  geom_point()

second option

ggplot() +
  geom_point(data = Births78, aes(x = date, y = births))  

How do we make this plot?

What has changed?

  • new aesthetic: mapping color to day of week

Adding day of week to the data set

The wday() function in the lubridate package computes the day of the week from a date.

Births78 <-  
  Births78 %>% 
  mutate(wday = wday(date, label = TRUE))
ggplot(data = Births78) +
  geom_point(aes(x = date, y = births, color = wday))

How do we make this plot?

This time we use lines instead of dots

ggplot(data = Births78) +
  geom_line(aes(x = date, y = births, color = wday)) 

How do we make this plot?

This time we have two layers, one with points and one with lines

ggplot(data = Births78, 
       aes(x = date, y = births, color = wday)) + 
  geom_point() +  geom_line()
  • The layers are placed one on top of the other: the points are below and the lines are above.

  • data and aes specified in ggplot() affect all geoms

Alternative Syntax

Births78 %>% 
  ggplot(aes(x = date, y = births, color = wday)) + 
  geom_point() + 
  geom_line()

What does this do?

Births78 %>%
  ggplot(aes(x = date, y = births, color = "navy")) + 
  geom_point()  

This is mapping the color aesthetic to a new variable with only one value ("navy").
So all the dots get set to the same color, but it's not navy.

Setting vs. Mapping

If we want to set the color to be navy for all of the dots, we do it this way:

Births78 %>%
  ggplot(aes(x = date, y = births)) +   # map these 
  geom_point(color = "navy")        # set this

  • Note that color = "navy" is now outside of the aesthetics list. That's how ggplot2 distinguishes between mapping and setting.

How do we make this plot?

Births78 %>%
  ggplot(aes(x = date, y = births)) + 
  geom_line(aes(color = wday)) +       # map color here
  geom_point(color = "navy")           # set color here
  • ggplot() establishes the default data and aesthetics for the geoms, but each geom may change these defaults.

  • good practice: put into ggplot() the things that affect all (or most) of the layers; rest in geom_blah()

Other geoms

apropos("^geom_") %>% head(21)
 [1] "geom_abline"     "geom_area"       "geom_ash"       
 [4] "geom_bar"        "geom_barh"       "geom_bin2d"     
 [7] "geom_blank"      "geom_boxplot"    "geom_boxploth"  
[10] "geom_col"        "geom_colh"       "geom_contour"   
[13] "geom_count"      "geom_crossbar"   "geom_crossbarh" 
[16] "geom_curve"      "geom_density"    "geom_density_2d"
[19] "geom_density2d"  "geom_dotplot"    "geom_errorbar"  

help pages will tell you their aesthetics, default stats, etc.

?geom_area             # for example

Let's try geom_area

Births78 %>%
  ggplot(aes(x = date, y = births, fill = wday)) + 
  geom_area()

This is not a good plot

  • overplotting is hiding much of the data
  • extending y-axis to 0 may or may not be desirable.

Side note: what makes a plot good?

Most (all?) graphics are intended to help us make comparisons

  • How does something change over time?
  • Do my treatments matter? How much?
  • Do men and women respond the same way?

Key plot metric: Does my plot make the comparisions I am interested in

  • easily, and
  • accurately?

Time for some different data

HELPrct: Health Evaluation and Linkage to Primary care randomized clinical trial

?HELPrct

Subjects admitted for treatment for addiction to one of three substances.

Why are these people in the study?

HELPrct %>% 
  ggplot(aes(x = substance)) + 
  geom_bar()

  • Hmm. What's up with y?

    • stat_bin() is being applied to the data before the geom_bar() gets to do its thing. Binning creates the y values.

Data Flow

\[ \mbox{org data} \stackrel{\mbox{\ stat\ }}{\longrightarrow} \mbox{statified} \stackrel{\mbox{aesthetics}}{\longrightarrow} \mbox{aesthetic data} \stackrel{\mbox{\ scales\ }}{\longrightarrow} \mbox{scaled data} \]

Simplifications:

  • Aesthetics get computed twice, once before the stat and again after. Examples: bar charts, histograms

  • We need to look at the aesthetics to figure out which variable to bin

    • then the stat does the binning
    • bin counts become part of the aesthetics for geom: y = ..count..
  • This process happens in each layer

  • stat_identity() is the "do nothing" stat.

How old are people in the HELP study?

HELPrct %>% 
  ggplot(aes(x = age)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Notice the messages

  • stat_bin: Histograms are not mapping the raw data but binned data.
    stat_bin() performs the data transformation.

  • binwidth: a default binwidth has been selected, but we should really choose our own.

Setting the binwidth manually

HELPrct %>% 
  ggplot(aes(x = age)) + 
  geom_histogram(binwidth = 2)

How old are people in the HELP study? – Other geoms

HELPrct %>% 
  ggplot(aes(x = age)) + 
  geom_freqpoly(binwidth = 2)

HELPrct %>% 
  ggplot(aes(x = age)) + 
  geom_density()

Selecting stat and geom manually

Every geom comes with a default stat

  • for simple cases, the stat is stat_identity() which does nothing
  • we can mix and match geoms and stats however we like
HELPrct %>% 
  ggplot(aes(x = age)) + 
  geom_line(stat = "density")

Selecting stat and geom manually

Every stat comes with a default geom, every geom with a default stat

  • we can specify stat instead of geom, if we prefer
  • we can mix and match geoms and stats however we like
HELPrct %>% 
  ggplot(aes(x = age)) + 
  stat_density( geom = "line")

More combinations

HELPrct %>% 
  ggplot(aes(x = age)) + 
  geom_point(stat = "bin", binwidth = 3) + 
  geom_line(stat = "bin", binwidth = 3)  

HELPrct %>% 
  ggplot(aes(x = age)) + 
  geom_area(stat = "bin", binwidth = 3)  

HELPrct %>% 
  ggplot(aes(x = age)) + 
  geom_point(stat = "bin", binwidth = 3, aes(size = ..count..)) +
  geom_line(stat = "bin", binwidth = 3) 

Your turn: How much do they drink? (i1)

Create a plot that shows the distribution of the average daily alcohol consumption in the past 30 days (i1).

How much do they drink? (i1)

HELPrct %>% 
  ggplot(aes(x = i1)) + geom_histogram()

HELPrct %>% 
  ggplot(aes(x = i1)) + geom_area(stat = "density")

Covariates: Adding in more variables

Q. How does alcohol consumption (or age, your choice) differ by sex and substance (alcohol, cocaine, heroin)?

Decisions:

  • How will we display the variables: i1 (or age), sex, substance

  • What comparisons are we most interested in?

Give it a try.

  • Note: I'm cheating a bit. You may want to do some things I haven't shown you yet. (Feel free to ask.)

Covariates: Adding in more variables

Using color and linetype:

HELPrct %>% 
  ggplot(aes(x = i1, color = substance, linetype = sex)) + 
  geom_line(stat = "density")

Using color and facets

HELPrct %>% 
  ggplot(aes(x = i1, color = substance)) + 
  geom_line(stat = "density") + facet_grid( . ~ sex )

Boxplots

Boxplots use stat_quantile() which computes a five-number summary (roughly the five quartiles of the data) and uses them to define a "box" and "whiskers". The quantitative variable must be y, and there must be an additional x variable.

HELPrct %>% 
  ggplot(aes(x = substance, y = age, color = sex)) + 
  geom_boxplot()

Horizontal boxplots

Horizontal boxplots are obtained by flipping the coordinate system:

HELPrct %>% 
  ggplot(aes(x = substance, y = age, color = sex)) + 
  geom_boxplot() +
  coord_flip()

  • coord_flip() may be used with other plots as well to reverse the roles of x and y on the plot.

Give me some space

We've triggered a new feature: dodge (for dodging things left/right). We can control how much if we set the dodge manually.

HELPrct %>% 
  ggplot(aes(x = substance, y = age, color = sex)) + 
  geom_boxplot(position = position_dodge(width = 1)) 

Issues with bigger data

require(NHANES)
dim(NHANES)
## [1] 10000    76
NHANES %>%  ggplot(aes(x = height, y = weight)) +
  geom_point() + facet_grid( gender ~ pregnantnow )

  • Although we can see a generally positive association (as we would expect), the overplotting may be hiding information.

Using alpha (opacity)

One way to deal with overplotting is to set the opacity low.

NHANES %>% 
  ggplot(aes(x = height, y = weight)) +
  geom_point(alpha = 0.01) + facet_grid( gender ~ pregnantnow )

geom_density2d

Alternatively (or simultaneously) we might prefere a different geom altogether.

NHANES %>% 
  ggplot(aes(x = height, y = weight)) +
  geom_density2d() + facet_grid( gender ~ pregnantnow )

Multiple layers

ggplot( data = HELPrct, aes(x = sex, y = age)) +
  geom_boxplot(outlier.size = 0) +
  geom_jitter(alpha = .6) +
  coord_flip()

Multiple layers

ggplot( data = HELPrct, aes(x = sex, y = age)) +
  geom_boxplot(outlier.size = 0) +
  geom_point(alpha = .6, position = position_jitter(width = .1, height = 0)) +
  coord_flip()

Labeling

NHANES %>% 
  ggplot(aes(x = height, y = weight)) +
  geom_point() + facet_grid( gender ~ pregnantnow ) +
  labs(x = "waist (m)", y = "weight (kg)", 
       title = "weight vs height")

Things I haven't mentioned (much)

  • scales (fine tuning mapping from data to plot)

  • guides (so reader can map from plot to data)

  • coords (coord_flip() is good to know about)

  • themes (for customizing appearance)

require(ggthemes)
ggplot(data = Births78, aes(x = date, y = births)) +
       geom_point() + theme_wsj() # wall street journal

Things I haven't mentioned (much)

  • scales (fine tuning mapping from data to plot)

  • guides (so reader can map from plot to data)

  • coords (coord_flip() is good to know about)

  • themes (for customizing appearance)

require(xkcd)
ggplot(data = Births78, aes(x = date, y = births, colour = wday)) +
       geom_smooth(se = FALSE) + theme_xkcd()

Things I haven't mentioned (much)

  • scales (fine tuning mapping from data to plot)

  • guides (so reader can map from plot to data)

  • coords (coord_flip() is good to know about)

  • themes (for customizing appearance)

  • position (position_dodge() can be used for side by side bars)

ggplot(data = HELPrct, 
       aes(x = substance, y = age, color = sex)) +
  geom_violin(position = position_dodge()) +
  geom_point(aes(color = sex, fill = sex), 
             position = position_jitterdodge()) 

Things I haven't mentioned (much)

  • scales (fine tuning mapping from data to plot)

  • guides (so reader can map from plot to data)

  • themes (for customizing appearance)

  • position (position_dodge(), position_jitterdodge(), position_stack(), etc.)

A little bit of everything

ggplot(data = HELPrct, 
       aes(x = substance, y = age, color = sex)) +
  geom_boxplot(position = position_dodge(width = 1)) +
  geom_point(aes(fill = sex), alpha = .5, 
    position = position_jitterdodge(dodge.width = 1)) + 
  facet_wrap(~homeless)

Want to learn more?

  • docs.ggplot2.org/

  • Winston Chang's: R Graphics Cookbook

  • google whatevery you want plus ggplot2

What's around the corner?

ggvis

  • dynamic graphics (brushing, sliders, tooltips, etc.)

  • similar structure to ggplot2 but different syntax and names

Dynamic documents

  • combination of RMarkdown, ggvis, and shiny

GitHub in R